Scalable system for clustering of large databases having mixed data attributes

ABSTRACT

One exemplary embodiment of a scalable clustering algorithm accesses a database of records having attributes or data fields of both enumerated discrete and ordered values and brings a portion of the data records into a rapid access memory. A cluster model for the data includes a table of probabilities for the enumerated, discrete data fields of the data records. The cluster model for data fields that are ordered comprises a mean and spread of the cluster. The cluster model is updated from the database records brought into the rapid access memory. At least some of the database records in the rapid access memory are summarized and stored within the rapid access memory. A criteria is then evaluated to determine if further data should be accessed from the database to further cluster data records in the database. Based on the evaluating step, additional database records in the database are accessed and brought into the rapid access memory for further updating of the cluster model.

CROSS REFERENCE TO RELATED APPLICATIONS

The present invention is a continuation in part of Ser. No. 09/083,906,U.S. Pat. No. 6,263,337 entitled “A Scalable System for ExpecttionMaximization Clustering of Large Databases” to Fayyad et al filed May22, 1998 issued Jul. 17, 2001, and which is assigned to the assignee ofthe present application and is incorporated herein by reference.

Priority is claimed from provisional U.S. patent application Ser. No.60/086,410 filed May 22, 1998.

FIELD OF THE INVENTION

The present invention concerns database analysis and more particularlyconcerns an apparatus and method for clustering of data into groups thatcapture important regularities and characteristics of the data.

BACKGROUND ART

Large data sets are now commonly used in most business organizations. Infact, so much data has been gathered that asking even a simple questionabout the data has become a challenge. The modern information revolutionis creating huge data stores which, instead of offering increasedproductivity and new opportunities, can overwhelm the users with a floodof information. Tapping into large databases for even simple browsingcan result in a return of irrelevant and unimportant facts. Even peoplewho do not ‘own’ large databases face the overload problem whenaccessing databases on the Internet. A large challenge now facing thedatabase community is how to sift through these databases to find usefulinformation.

Existing database management systems (DBMS) perform the steps ofreliably storing data and retrieving the data using a data accesslanguage, typically SQL. One major use of database technology is to helpindividuals and organizations make decisions and generate reports basedon the data contained in the database.

An important class of problems in the areas of decision support andreporting are clustering (segmentation) problems where one is interestedin finding groupings (clusters) in the data. Data clustering has beenused in statistics, pattern recognition, machine learning, and manyother fields of science and engineering. However, implementations andapplications have historically been limited to small data sets with asmall number of dimensions or fields.

Each data cluster includes records that are more similar to members ofthe same cluster than they are similar to rest of the data. For example,in a marketing application, a company may want to decide who to targetfor an ad campaign based on historical data about a set of customers andhow they responded to previous campaigns. Employing analysts(statisticians) to build cluster models is expensive, and often noteffective for large problems (large data sets with large numbers offields). Even trained scientists can fail in the quest for reliableclusters when the problem is high-dimenisional (i.e. the data has manyfields, say more than 20).

Clustering is a necessary step in the mining of large databases as itrepresents a mean is for finding segments of the data that need to bemodeled separately. This is an especially important consideration forlarge databases where a global model of the entire data typically makesno sense as data represents multiple populations that need to be modeledseparately. Random sampling cannot help in deciding what the clustersare. Finally, clustering is an essential step if one needs to performdensity estimation over the database (i.e. model the probabilitydistribution governing the data source).

Applications of clustering are numerous and include the following broadareas: data mining, data analysis in general, data visualization,sampling, indexing, prediction, and compression. Specific applicationsin data mining including marketing, fraud detection (in credit cards,banking, and telecommunications), customer retention and churnminimization (in all sorts of services including airlines,telecommunication services, internet services, and web informationservices in general), direct marketing on the web and live marketing inElectronic Commerce.

Clustering has been formulated in various ways. The fundamentalclustering problem is that of grouping together (clustering) data itemsthat are similar to each other. The most general approach to clusteringis to view it as a density estimation problem. We assume that inaddition to the observed variables for each data item, there is ahidden, unobserved variable indicating the “cluster membership” of thegiven data item. Hence the data is assumed to arrive from a mixturemodel and the mixing labels (cluster identifiers) are hidden. Ingeneral, a mixture model M having K clusters Ci, i=1, . . . , K, assignsa probability to a particular data record or point x as follows:${\Pr \left( x \middle| M \right)} = {\sum\limits_{i = 1}^{K}\quad {W_{i} \cdot {\Pr \left( {\left. x \middle| {Ci} \right.,M} \right)}}}$

where W_(i) are called the mixture weights.

The problem then is estimating the parameters of the individual Ci.Usually it is assumed that the number of clusters K is known and theproblem is to find the best parameterization of each cluster model. Apopular technique for estimating the model parameters (including clusterparameters and mixture weights) is the EM algorithm (see P. Cheesemanand J. Stutz, “Bayesian Classification (AutoClass): Theory and Results”,in Advances in Knowledge Discovery and Data Mining, Fayyad, U., G.Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), pp. 153-180. MITPress, 1996; and A. P. Dempster, N. M. Laird, and D. B. Rubin, “MaximumLikelihood from Incomplete Data via the EM algorithm”. Journal of theRoyal statistical Society, Series B, 39(1): 1-38, 1977).

There are various approaches to solving the optimization problem ofdetermining (locally) optimal values of the parameters given the data.The iterative refinement approaches are the most effective. The basicalgorithm goes as follows:

1. Initialize the model parameters, producing a current model.

2. Decide memberships of the data items to clusters, assuming that thecurrent model is correct.

3. Re-estimate the parameters of the current model assuming that thedata memberships obtained in 2 are correct, producing a new model.

4. If the current model and the new model are sufficiently close to eachother, terminate, else go to 2. In this step ‘close’ is evaluated by apredefined one of multiple possible stopping criteria.

The most popular clustering algorithms in the pattern recognition andstatistics literature belong to the above iterative refinement family:the K-Means algorithm. See E. Forgy, “Cluster analysis of multivariatedata: Efficiency vs. interpretability of classifications”, Biometrics21:768. 1965 or J. MacQueen, “Some methods for classification andanalysis of multivariate observations. In Proceedings of the FifthBerkeley Symposium on Mathematical Statistics and Probability. Volume I,Statistics, L. M. Le Cam and J. Neyman (Eds.). University of CaliforniaPress, 1967.

There are other variants of clustering procedures that iterativelyrefine a model by rescanning the data many times. The difference betweenthe EM and K-Means is the membership decision (step 2). In K-Means, adata item belongs to a single cluster, while in EM each data item isassumed to belong to every cluster but with a different probability.This of course affects the update step (3) of the algorithm. In K-Meanseach cluster is updated based strictly on its membership. In EM eachcluster is updated by contributions from the entire data set accordingto the relative probability of membership of each data record in thevarious clusters.

SUMMARY OF THE INVENTION

The present invention concerns automated analysis of large databases toextract useful information such as models or predictors from data storedin the database. One of the primary operations in data mining isclustering (also known as database segmentation). One of the mostwell-known algorithms for probabilistic clustering of a database withboth discrete and continuous attributes is the Expectation-Maximization(EM) algorithm applied to a Multinomial/Gaussian mixture.

Discrete data refers to instances wherein the values of a particularfield in the database are finite and not ordered. For instance, color isa discrete feature having possible values (green, blue, red, white,black) and it makes no sense to impose an ordering on these values (i.e.green>blue?).

When applied to a Multinomial/Gaussian mixture the E-M process modelsthe discrete fields of the database with Multinomial distributions andthe continuous fields of the database with a Gaussian distribution. TheMultinomial distribution is associated with each attribute and ischaracterized by a set of probabilities, one probability for eachpossible value of the corresponding attribute in the database. TheGaussian distribution for continuous data is characterized by a mean anda covariance matrix. The EM process estimates the parameters of thesedistributions over the database as well as the mixture weights definingthe probability model for the database${\Pr \left( x \middle| M \right)} = {\sum\limits_{i = 1}^{K}\quad {W_{i} \cdot {{\Pr \left( {\left. x \middle| {Ci} \right.,M} \right)}.}}}$

These statistics (set of probabilities for each discrete attribute/valuepair and mean and covariance matrix for continuous attributes) provideessential summary statistics of the database and allow for aprobabilistic interpretation regarding the membership of a given recordin a particular cluster. Given a desired number of clusters K, eachcluster is represented by a Gaussian distribution over the continuousdatabase variables and a Multinomial distribution for each discreteattribute (characterized by a probability of observing each value ofthis discrete attribute). The parameters associated with each of thesedistributions are estimated by the EM algorithm.

One exemplary embodiment of a scalable clustering algorithm accesses adatabase of records having attributes or data fields of both enumerateddiscrete and ordered (continuous) values and brings a portion of thedata records into a rapid access memory. Each cluster of databaserecords is represented by a table of probabilities summarizing theenumerated, discrete data fields of the data records in this cluster,and a mean and covariance matrix summarizing the continuous attributesof the records in the cluster. Each entry in the probability table forthe discrete attributes represents the probability of observing aspecific value of a given discrete attribute in the considered cluster.The mean vector and covariance matrix summarize the distribution of thevalues of the continuous attributes in the considered cluster. Theclusters are updated from the database records brought into the rapidaccess memory. Sufficient statistics for at least some of the databaserecords in the rapid access memory are summarized. The sufficientstatistics (summaries) are made up of a data structure similar to theclusters, i.e. it includes a Gaussian distribution over the continuousrecord attributes and Multinomial distributions for the discreteattributes. The sufficient statistics are stored within the rapid accessmemory and the database records that are used to derive these sufficientstatistics are removed from rapid access memory. A criteria is thenevaluated to determine if further data should be accessed from thedatabase to further cluster data records in the database. Based on thisevaluation, additional database records in the database are accessed andbrought into the rapid access memory for further updating of the clustermodel.

The invention can be used in data mining to: visualize, summarize,navigate, and predict properties of the data/clusters in a database. Theparameters allow one to assign data points (database records) to acluster in a probabilistic fashion (i.e. a point or database recordbelongs to all K clusters with a computable, and interpretableprobability). Probabilistic clustering also plays an important role inoperations such as sampling, indexing, and increasing the efficiency ofdata access in a database. The invention consists of a new methodologyand implementation for scaling the EM algorithm to work with largedatabases consisting of both discrete and continuous attributes,including ones that cannot be loaded into the main memory of thecomputer. Without this method, clustering would require significantlymore memory, or many unacceptably expensive scans of the data in thedatabase.

This invention enables effective and accurate clustering in one or lessdatabase scans. Furthermore, known previous computational work onclustering with the EM algorithm addressed datasets that are eitherdiscrete or continuous. Often, if the database contained both discreteand continuous fields, the continuous fields are discretized prior toapplying the clustering technique. The present invention avoids removingthis natural order from certain fields of the database and explicitlyaddress the issue of probabilistically clustering a database with bothdiscrete and continuous attributes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic depiction of a computer system used in practicingan exemplary embodiment of the present invention;

FIGS. 2 and 3 are schematic depictions of software components forperforming data clustering in accordance with an exemplary embodiment ofthe present invention;

FIG. 4 is a flowchart of the processing steps performed by the computersystem in clustering data;

FIG. 5 is a depiction of three clusters over a single, continuousattribute showing their relative positions on a one dimensional scale;

FIGS. 6A-6D are data structures described in the parent application thatare used in computing a model summary for data clusters based on datahaving only continous attributes;

FIGS. 7A and 7B are flow charts of a preferred clustering procedure foruse with data having mixed continuous and discrete data;

FIGS. 8A-8D are data structures used in computing a clustering modelfrom a database having both continuous and discrete attributes; and

FIGS. 9A and 9B are probability tables depicting sufficient statisticsfor discrete attributes of a data base having both discrete andcontinuous attribute records.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENT OF THE INVENTION

The exemplary embodiment of the invention is implemented by softwareexecuting on a general purpose computer 20 a schematic of which is shownin FIG. 1. FIGS. 2 and 3 depict software components that define a datamining engine 12 constructed in accordance with the present invention.The data mining engine 12 clusters data records stored on a database 10.The data records have multiple attributes or fields that contain bothdiscrete and continuous data. Depending on the number of data recordsand the size of those records, the database 12 is stored on a singlefixed disk storage device or alternately can be stored on multipledistributed storage devices accessible to the computer's processing unit21 over a network. In accordance with the present invention, the datamining engine 12 brings data from the database 10 into a memory 22(FIG. 1) and outputs a clustering model (FIG. 8D).

The invention has particular utility in clustering data from a database12 that contains many more records than can be stored in the computer'smain memory 22. Data clustering is particularly important in data miningof large databases as it represents a means for finding segments orsubpopulations of the data that are similar. For a large database, aglobal model of the entire database makes little sense since the datarepresents multiple populations that need to be modeled separately. Thepresent invention concerns method and apparatus for determining a modelfor each cluster that includes a set of attribute/value probabilitiesfor the enumerated discrete data fields and a mean and covariance matrixfor the ordered data fields.

In a client/server implementation an application program 14 acts as theclient and the data mining engine 12 acts as the server. The applicationprogram 14 receives an output model (FIG. 8D) and makes use of thatmodel in one of many possible ways mentioned above such as marketingstudies and fraud detection etc.

Discrete Data Fields

Consider five database records (Table 1) having three enumerated datafields. These records could be used, for example, by people makingmarketing evaluations of past trends for use in predicting futurebehavior. The data records describe purchases of motor vehicles. Onlyfive records are depicted for purposes of explanation but a typicaldatabase will have many thousands if not millions of such records. Toperform the clustering of the present invention the data records areread from the database 12 into a memory of a computer and the number ofrecords stored in a memory at a given time is therefore dependent on theamount of computer memory allocated for the clustering process.

TABLE 1 Record ID Color Style Sex 1 yellow sedan male 2 blue sedanfemale 3 green sedan male 4 white truck male 5 yellow sport female

Assume an initial cluster number K=3 is used to cluster the data. Thenumber 3 can be arbitrarily assigned or can be chosen based upon aninitial evaluation of the data. In addition, the initial values of theattribute/value probability tables for the discrete attributes in eachcluster are initialized by some other process (possibly randominitialization). Each record from Table 1 is assigned to each of thethree clusters with a different probability or membership. Thisprobability of membership of a data record in one of the clusters iscomputed based upon the attribute values of the record and the clusterattribute/value probability tables (discrete attributes). Consider thefollowing partially determined cluster probabilities for a clusteringevaluation of records like those depicted in Table 1. Suppose Cluster 1represents 3.0 of 10.0 total data records, Cluster 2 represents 4.5 ofthe 10.0 data records and Cluster 3 represents the remaining 2.5 of the10.0 data records

Cluster Attribute/Value Probability Tables Cluster 1: Number of records:3.0 color R B G W .1 .2 .5 .2 style sedan sport truck .5 .4 .1 sex malefemale .3 .7 Cluster 2: Number of records: 4.5 color R B G W .3 .1 .1 .5style sedan sport truck .1 .6 .3 sex male female .65 .35 Cluster 3:Number of records: 2.5 color R B G W .3 .25 .25 .2 style Sedan sporttruck .35 .3 .35 sex Male female .45 .55

As the data mining engine 10 gathers data from the database 12 andanalyses that data, these cluster models change based on the values ofthe data gathered from the database. Note, that the attribute/valueprobability tables have a row for each discrete attribute and a columnfor each value of the corresponding attribute and that the sum of theprobabilities for the values of a given attribute of a given cluster sumto 1.0.

Consider RecordID #2 from table 1. The values of the three attributesfor this record are ‘blue’, ‘sedan’, and ‘female’ corresponding to aninstance of a female purchasing a blue sedan. Based on the aboveprobabilities this record belongs to cluster 1 with a probabilityproportional to (0.2)(0.5)(0.7)=0.070. This value is determining byfinding the attribute ‘blue’ and noting that for cluster number 1 theprobability of this attribute is 0.2. Similarly, the probabilities for‘sedan’ and ‘female’ in cluster number 1 are 0.5 and 0.7 respectively.The product of these three probabilities is 0.070.

Now examine the values of this record for clusters 2 and 3.

Cluster 2: (0.1)(0.1)(0.35)=0.0035.

Cluster 3: (0.25)(0.35)(0.55)=0.048125

To take into account the number of data records represented by eachcluster, we multiply each cluster value by the fraction of total recordsbelonging to the cluster to obtain the following values:

Cluster 1: (3.0/10.0)*(0.070)=0.021

Cluster 2: (4.5/10.0)*(0.035)=0.01575

Cluster 3: (2.5/10.0)*(0.048125)=0.01203

These values are normalized (adjusted so they sum to 1.0 by dividingeach of the 3 values by the sum) giving the probabilities of membershipof RecordID#2 in each of the 3 clusters:

Probability of membership of RecordID#2 in Cluster 1: 0.4305

Probability of membership of RecordlD#2 in Cluster 2: 0.3228

Probability of membership ofRecordID#2 in Cluster 3: 0.2467

For this discussion the values are placed into the augmented recordchart of Table 2.

TABLE 2 Cluster Probability ID Color Style Sex #1 #2 #3 1 yellow sedanmale 2 blue sedan female 0.4305 0.3228 0.2467 3 green sedan male 4 whitetruck male 5 yellow sport female

After a given subset of data records have been read from the database,the cluster attribute/value probabilities for the three clusters areupdated based upon the cluster membership probabilities for each of therecently gathered data records as well as records evaluated duringprevious steps of gathering data from the database 12.

The process of updating these probabilities takes into account thenumber of data records that have been previously gathered from thedatabase as well as the number of new records that were most recentlyextracted from the database 12. As a simplified example, assume a totalof ten records having been classified and have been used to determinethe Cluster Models shown above.

Suppose Cluster No. 1 represents 3.0 of the 10 data points alreadyprocessed and we wish to update the attribute/value probabilities forthe discrete attribute Color in cluster No. 1 based on the addition ofRecordID #2, one has:

Number of Points color represented R B G W (ten records) 3.0 .1 .2 .5 .2(one record, ID #2) 0.4305 0.0 1.0 0.0 0.0 (eleven records) 3.43050.0875 0.3001 0.4374 0.1750

The entry of 3.0 under “Number of Points represented” for the tenrecords indicates that 3.0 of the 10 records processed thus far arerepresented by this cluster. Similarly, for cluster #1, we've alreadycalculated that 0.4305 of RecordID#2 is assigned to this cluster, while0.3228 of it is assigned to cluster 2 and 0.2467 of it is assigned tocluster 3. Hence, after RecordID#2 is processed, we've seen 11 recordsand 3.0+0.4305=3.4305 of these 11 points are represented by cluster #1.

The formula for computing the updated probability for the red value (R)of the color attribute is:((3.0)*(0.1)+(0.4305)*(0.0))/(3.0+0.4305)=(0.3)/(3.4305)=0.0875

The formula for computing the update for blue (B) is:

((3.0)*(0.2)+(0.4305)*(1.0))/(3.0+0.4305)=(1.0305)/(3.4305)=0.3001

The formula for computing the update for green (G) is:

((3.0)*(0.5)+(0.4305)*(0.0))/(3.0+0.4305)=(1.5)/(3.4305)=0.4374

The formula for computing the update for white (W) is:

((3.0)*(0.2)+(0.4305)*(0.0))/(3.0+0.4305)=(0.6)/(3.4305)=0.1750

The other attribute probabilities for cluster #1 are updated in asimilar fashion:

Number of Points style sex represented sedan sport truck male female(ten records) 3.0 .5 .4 .1 .3 .7 (one record, 0.4305 1.0 0.0 0.0 0.0 1.0ID #2) (eleven 3.4305 0.5627 0.350 0.0873 0.2624 0.7376 records)

An updated attribute/value probability table for this cluster showingall three discrete attributes and their updated values is shown in FIG.9B.

When computing the probability memberships of a given data record ineach cluster, we also contemplate not taking into account the fractionof the database represented by a given cluster. In this case probabilityof membership of RecordID#2 in each of the 3 clusters would beproportional to the following values which do not account for thefraction of data records in each cluster:

Cluster 1: (0.2)(0.5)(0.7)=0.070

Cluster 2: (0.1)(0.1)(0.35)=0.0035.

Cluster 3: (0.25)(0.35)(0.55)=0.048125

The resulting probabilities of membership for RecordID#2 based upon theabove values are:

Cluster 1: (0.070)/(0.070+0.0035+0.048125)=0.5755

Cluster 2: (0.0035)/(0.070+0.00354+0.048125)=0.0288

Cluster 3: (0.048125)/(0.070+0.0035+0.048125)=0.3957

And these values could be used to update the attribute/value probabilitytables for each of the clusters in the same fashion.

Mixed Data Clustering Model

Now assume that instead of including only data records with discretenon-ordered data, the records read from the database 12 have discretedata like color and ordered (continuous) attributes such as a salaryfield and an age field. These additional fields are continuous and itmakes sense to take the mean and covariance, etc. of the values forthese additional fields. For each of the 3 clusters being modeled, onecan assign a Gaussian (having a mean and covariance matrix) to theincome and age attributes and calculate contributions to each clusterfor each data record based upon its attribute values.

Now again consider the records from Table 1. In addition to thepreviously discussed three attributes of ‘color’, ‘style’ and ‘sex’,each record has the additional attributes of ‘income’ and ‘age’. Thesemixed attribute records arc listed below in Table 3. Note, the femalethat purchased the blue sedan (RecordId #2) is now further classifiedwith the information that she has an income of 46K and an age of 47years.

TABLE 3 Record ID Color Style Sex Income Age 1 yellow sedan male 24K 32yrs 2 blue sedan female 46K 47 3 green sedan male 82K 66 4 white truckmale 40K 30 5 yellow sport female 38K 39

For each of the records of Table 3 the data mining engine 12 mustcompute the probability of membership of each data record in each of thethree clusters. Suppose, in the general case, the discrete attributesare labeled “DiscAtt#1”, “DiscAtt#2”, . . . , “DiscAtt#d” and let theremaining continuous attributes make up a numerical vector x. Thenotation for determining this probability is:

Prob(record|cluster #)=p(DiscAtt#1|cluster #)*p(DiscAtt#2|cluster #)* .. . *p (DiscAtt#d|cluster #*p(x|μ,Σ of cluster #). Herep(DiscAttr#j|cluster #) is computed by looking up the stored probabilityof DiscAttr#j in the given cluster (i.e. reading the current probabilityfrom the attribute/value probability table associated with thiscluster). p(x|μ,Σ of cluster #) is calculated by computing the value ofx under a normal distribution with mean A and covariance matrix Σ:${p\left( {\left. x \middle| \mu \right.,{\Sigma \quad {of}\quad {cluster}\quad \#}} \right)} = {\frac{1}{\left( {2\pi} \right)^{n/2}\sqrt{\Sigma }}\exp \left\{ {{- \frac{1}{2}}\left( {x - \mu} \right)^{T}(\Sigma)^{- 1}\left( {x - \mu} \right)} \right\}}$

When performing an expectation maximization (EM) clustering analysissuch as an analysis described in the Fayyad et al parent patentapplication Ser. No. 09/083,906, each data point is assigned to each ofthe K clusters (K=3 in the above example) with a probability orweighting factor. For example, if all the attributes of data records inthe database are ordered, then each cluster has a mean and a covariancematrix of a dimension equal to the number of attributes in the datarecord. For a data record having n ordered dimensions, the resultingcovariance matrix is an n×n matrix.

If the EM analysis is used in conjunction with an exemplary embodimentof the present invention, one associates a Gaussian distribution of dataabout the centroid of each of the K clusters for the ordered dimensions.For each of the data records (having mixed discrete and orderedattributes) a weighting factor is similarly determined indicating thedegree of membership of this data record in a given cluster. In ourexample with 3 clusters, the weightings are determined by:

Weight in cluster 1=P(record|cluster 1)/[P(record|cluster1)+P(record|cluster 2)+P(record|cluster 3)]. Weight in cluster2=P(record|cluster 2)/[P(record|cluster 1)+P(record|cluster2)+P(record|cluster 3)]. Weight in cluster 3=P(record|cluster3)/[P(record|cluster 1)+P(record|cluster 2)+P(record|cluster 3)]. HereP(record|cluster #) is given as above.

FIG. 5 depicts three Gaussian distribution curves. One dimension isplotted for simplicity, but note that the height of a given Gaussiancurve is the p([income,age]|cluster #). The Gaussian data distributionsG1, G2, G3 summarize data clusters having centroids or means {overscore(x)}¹, {overscore (x)}², {overscore (x)}³ and represent thedistributions over the continuous attributes in the 3 clusters in ourexample. The compactness of the data is generally indicated by the shapeof the Gaussian and quantified by the corresponding covariance value andthe average value of the cluster is given by the mean.

Now consider the data point identified along the axis as the point “X”having an annual income=$46,000 from RecordID#2. RecordID#2 ‘belongs’ toall three of the clusters described by the three Gaussians G1, G2, G3and the attribute/value probability table associated with each clusterfor the discrete attributes. Note that the height of this data pointunder Gaussian G3 is negligible, but call this height h3. Suppose theheight of this data point under Gaussian G2 is h2 and the height of thisdata point under Gaussian G1 is h1. Then the probability that RecordID#2is in cluster 1 is determined by the height h1, the values of theattribute/value probability table for cluster 1 and the number of datapoints represented by cluster 1. The value is computed as (fraction ofdata points represented by cluster 1)*(P(RecordID#2|cluster 1)=(fractionof data points represented by cluster 1)*(P(DiscAtt#1=Blue|cluster1))*(P(DiscAtt#2=Sedan|cluster 1)*P(DiscAtt#3=female|cluster 1)*p([46K,47 yrs]|μ, Σ of cluster 1)=(3.0/10.0)*(0.2)*(0.5)*(0.7)*h1. Similarly,the probability that this record is in cluster 2=(fraction of datapoints represented by cluster 2)*P(record ID#2 cluster2)=(4.5/10.0)*(0.1)*(0.1)*(0.35)*h2. The probability that this record isin cluster 3 is (fraction of data points represented by cluster3)*P(record ID#2|cluster 3)=(2.5/10.0)*(0.25)*(0.35)*(0.55)*h3. Then theweight of this data point in cluster 1 is Weightl=[(3.0/10.0)*P(recordID#2|cluster 1)]/[(3.0/10.0)*P(record ID#2|cluster1)+(4.5/10.0)*P(record ID#2|cluster 2)+(2.5/10.0)*P(record ID#2|cluster3)]. Similarly, the weight of this data point in cluster 2 isWeight2=[(4.5/10.0)*P(record ID#2|cluster 2)]/[(3.0/10.0)*P(recordID#2|cluster 1)+(4.5/10.0)*P(record ID#2|cluster 2)+(2.5/10.0)*P(recordID#2|cluster 3)]. And the weight of this data point in cluster 3 isWeight3=[(2.5/10.0)*P(record ID#3|cluster, 1)]/[(3.0/10.0)*P(recordID#2|cluster 1)+(4.5/10.0)*P(record ID#2|cluster 2)+(2.5/10.0)*P(recordID#2|cluster 3)].

The weights Weight1, Weight2 and Weight3 indicate the “degree ofmembership” of record ID#2 has in each of the 3 clusters. Knowing theseweights the probability tables are updated as described above and thevalues of μ and Σ are updated in the cluster model.

Overview of Scalable Clustering

FIG. 4 is a flow chart of the process steps performed during a scalableclustering analysis of data in accordance with the present invention. Asdescribed in the parent application to Fayyad et al, when onlycontinuous data attributes are used for clustering, during aninitialization step 100, the data structures shown in FIGS. 6A-6D areinitialized. When mixed discrete and continuous data attributes arecontained in the data records, the FIG. 6 data structures are augmentedwith probability tables P (one for each cluster) as seen in FIGS. 8A-8D.

Clustering is initiated by obtaining 110 a sample data portion from thedatabase 10 and bringing that data portion into a random access memory(into RAM for example, although other forms of random access memory arecontemplated) of the computer 20 shown in FIG. 1. A data structure 180for data having both discrete and continuous fields is shown in FIG. 8Cto include a number r of records having a number of attributes D=n+dwhere there are n continuous attributes and d discrete attributes.

The gathering of data can be performed using either a sequential scanthat uses only a forward pointer to sequentially traverse the data or anindexed scan that provides a random sampling of data from the database.When using the index scan it is a requirement that data not be accessedmultiple times. This can be accomplished by marking data tuples to avoidduplicates, or a random index generator that does not repeat. Inparticular, it is most preferable that a first iteration of samplingdata be done randomly. If it is known the data is random within thedatabase then sequential scanning is acceptable. If it is not known thatthe data is randomly distributed, then random sampling is needed toavoid an inaccurate representative of the database.

A processor unit 21 of the computer 20 next performs a clusteringprocedure 120 using the data brought into memory in the step 110 as wellas compressed data in two data structures CS, DS (FIGS. 8A, 8B). Inaccordance with an exemplary clustering process the processor unit 21assigns data contained within the portion of data brought into memory toa cluster for purposes of recalculating the cluster probabilities forthe discrete data attributes and the Gaussian mean and covariance matrixfor the continuous data attributes.

A data structure for the results or output model of the analysis for theordered attributes is depicted in FIG. 8D. This model includes K datastructures for each cluster. Each cluster is defined by 1) a vector‘Sum’ representing the sum of each of the database records for each ofthe ordered or continuous attributes or dimensions (n=number ofcontinuous attributes), 2) a vector ‘Sumsq’ representing the sum of thecontinuous attributes squared, 3) a floating point value ‘M’ countingthe number of data records contained in or belonging to thecorresponding cluster, and 4) an attribute/value probability table suchas the table depicted in FIG. 9A, summarizing the discrete attributes(d=number of discrete attributes).

The parameters represented in the data structures (FIG. 8D) enable thedata mining engine to assign a probability of cluster membership forevery data record read from memory. The scalable clustering processneeds this probability to determine data record membership in the DS,CS, and RS data sets (discussed below), as part of a data compressionstep 130.

A data compression step 130 in the FIG. 4 flowchart summarizes at leastsome of the data gathered in the present iteration. This summarizationis contained in the data structures DS, CS of FIGS. 8A and 8B. If theloop iteration in the FIG. 4 process does not produce a satisfactorymodel (as described below) these RS, DS, and CS data structures are usedto update the model of FIG. 8D during a next loop iteration. Thesummarization into the DS and CS structures takes significantly lessstorage in a computer memory 25 than the vector data structure (FIG. 8C)needed to store individual records. Storing a summarization of the datain the data structures of FIGS. 8A and 8B frees up more memory allowingadditional data to be sampled from the database 10.

Before looping back to get more data at the step 110, the processor 21determines 140 whether a stopping criteria has been reached. Onestopping criteria that is used is whether the analysis has produced asufficient model (FIG. 8D) by a standard that is described below. Asecond stopping criterion has been reached if all the data in thedatabase 10 has been used in the analysis.

One feature of the invention is the fact that instead of stopping theanalysis, the analysis can be suspended. Data in the data structures ofFIG. 8A-8D can be saved (either in memory or to disk) and the scalableclustering analysis can then be resumed later. This allows the database10 to be updated and the analysis resumed to update the clusteringstatistics without starting from the beginning. It also allows anotherprocess to take control of the processor 21 without losing the state ofthe clustering analysis. The suspension could also be initiated inresponse to a user request that the analysis be suspended by means of auser actuated control on an interface presented to the user on a monitor47 while the Clustering analysis is being performed.

Perturbation Data Compression

The present data clustering process is particularly useful forclustering large databases. The process frees up memory so that moredata from the database can be accessed. This is accomplished bycompressing data and storing sufficient statistics for compressed datain the memory thereby freeing up memory for the gathering of more datafrom the database. For each of the K clusters a confidence interval onthe Gaussian mean is defined for each of the continuous attributes and aconfidence interval is defined for each value in the attribute/valueprobability table for the discrete attributes. Appendix A describes oneprocess for setting up a confidence interval oil the multidimensionalGaussian means associated with the continuous attributes of the Kclusters.

Consider the example of five attributes of color, style, sex, income andage from table 3. For the discrete attributes such as color, the model(FIG. 8D) includes probabilities for each attribute value (See FIG. 9A)in the range of between 0.0 and 1.0. When determining which of the datarecords can safely be compressed, the data mining engine 12 sets up aconfidence interval that brackets these probabilities.

For color: Red Blue Green White .1 +/− .005 .2 +/− .005 .5 +/− .01 .2+/− .008 For Style: Sedan sport truck .5 +/− .007 .4 +/− .003 .1 +/−.002 For sex: Male Female .3 +/− .05 .7 +/− .05

Confidence intervals are also set up for the continuous attributes foreach of the clusters. Assume that for cluster #1 the mean incomeattribute is $40,000 and the confidence interval is $1500 above andbelow this value. The age attribute confidence interval for cluster #1is 45 yrs+/−2.

Now consider the second data record. As calculated above, this datarecord was assigned to cluster #1 with highest probability ofmembership. The perturbation technique determines whether to compress arecord into the DS data structure (FIG. 8A) by adjusting theprobabilities of the cluster to which the record is assigned so that theprobability of membership in this “adjusted” cluster is decreased(lowers the attribute/value probabilities within the confidence intervalfor the discrete attributes and shifts the cluster mean away from thedata record for the continuous attributes) and adjusts the probabilitiesand means of the clusters to which the data record is not assigned sothat the probability of membership in these “adjusted” clusters isincreased by raising the attribute/value probabilities and shifting themean toward the data record for the continuous attributes. This processmaximizes the possibility that the RecordID #2 will be assigned to adifferent cluster with highest probability of membership.

With these temporary adjustments, the calculations for the data recordmembership are again performed. If the data record (RecordID #2) doesnot change cluster membership (maximum probability of cluster membershipis the original cluster) the sufficient statistics for this data recordcan be safely added to the DS data structure in FIG. 8A. The adjustedattribute/value probabilities and cluster means are returned to theiroriginal state.

Assume Record Id #2 is compressed at this stage. The record is removedfrom the RS list of records, and its attribute values are used to formthe sufficient statistics contained in the set DS associated with thecluster with highest probability of membership of RecordID#2. The DSdata structure consists of Gaussian sufficient statistics (Sum,Sumsq,M)summarizing the values of continuous attributes and an attribute/valueprobability table summarizing the values of the discrete attributes (SeeFIG. 8A). The processing step 130 visits each record, attempts tocompress that record and if the record can be compressed the vectors ofSUM, SUMSQ, and M and the attribute/value probability tables P are allupdated. The tables P associated with the DS and CS data structures nowcontain sufficient statistics of discrete attributes for compressedrecords that are removed from memory.

Thresholding Data Compression

A second data compression process is called thresholding. One can sortall the data points falling within a given cluster based on theprobability assigned to them i.e. (fraction of data points representedby cluster 1)*p(discrete_(r1)|cluster#1)*p(discrete_(r2)|cluster1)* . .. p(descrete_(rd)|cluster#1)*p(continuous_(r)|μ,Σof cluster #1) andchoose for compression into the DS dataset the data points having thehighest probability of membership. An additional alternate thresholdprocess would be to take all the data points assigned to a cluster andcompress into DS all the data points where the product of theprobabilities is greater than a threshold value.

Subclustering

The subclustering is done after all possible data records have beencompressed into the DS data structure. The remaining candidates forsummarization into the CS data structures (FIG. 8B) are first filteredto see if they are sufficiently “close” to an existing CS subcluster or,equivalently their probability of membership in an existing CSsubcluster is sufficiently high. If not, a clustering is performed usingrandom starting-conditions. Subclusters lacking a requisite number ofdata points are put back in RS and the remaining subclusters are merged.

Assume that the set RS (FIG. 8C) consists of singleton data points andthe compressed points have been removed from the RS dataset and havebeen summarized in the DS data set, contributing to the values of Sum,Sumsq, M, and the attribute/value probability table in the DS structure.Let m be the number of singleton data elements left in RS. SetCS_New=empty. Set k′ to be number of subcluster candidates to searchfor. Randomly choose k′ elements from RS to use as an initial startingpoint for a classic EM clustering. Run classic EM with harsh assignmentsover the data remaining in RS with the initial point. Harsh assignmentsin classic EM can be accomplished by assigning a data record with weight1.0 to the subcluster with highest probability of membership and notassigning it to any other subcluster. This procedure will determine k′candidate subclusters. Set up a new data structure CS_New to contain theset of sufficient statistics, including attribute/value probabilitytables for the discrete attributes of records associated with the k′candidate subclusters determined in this manner. For each set ofsufficient statistics in CS_New, if the number of data pointsrepresented by these sufficient statistics is below a given threshold,remove the set of sufficient statistics from CS_New and leave the datapoints generating these sufficient statistics in RS.

For each set of sufficient statistics in CS_New remaining, if themaximum standard deviation along any continuous dimension of thecorresponding candidate subcluster is greater than a threshold β, or themaximum standard deviation of an entry in the attribute/valueprobability table is greater than β/2 (β in the range [0,1]), remove theset of sufficient statistics from CS_New and keep the data pointsgenerating these sufficient statistics in RS. The value of β/2 isderived as follows: the standard deviation of a probability p issqrt(p*(1.0−p)). This value is maximized when p=0.5 in which casesqrt(p*(1.0−p))=sqrt(0.25)=0.5. Hence, in the worst case, the standarddevation is 0.5. Since β takes values between [0,1], wethreshold thestandard deviation of the probability by β/2.

Set CS_Temp=CS_New ∪CS. Augment the set of previously computedsufficient statistics CS with the new ones surviving the filtering insteps 6 and 7. For each set of sufficient statistics s (corresponding toa sub-cluster) in CS_Temp Determine the s′, the set of sufficientstatistics in CS_Temp with highest probability of membership in thesubcluster represented by s.

If the subcluster formed by merging s and s′, denoted by merge(s, s′) issuch that the maximum standard deviation along any continuous dimensionis less than β or the maximum standard deviation of an entry in theattribute/value probability table is greater than β/2 (β in the range[0,1]), then add merge(s, s′) to CS_Temp and remove s and s′ fromCS_Temp.

Set CS=CS_Temp. Remove from RS all points that went into CS, (RS=RS−CS.)Note that the vectors Sum, Sumsq, values of M and the attribute/valueprobability tables for the newly-found CS elements were determined inthe sub-clustering process or in the merge processes. Note that thefunction merge(s,s′) simply computes the sufficient statistics for thesub-cluster summarizing the points in both s and s′ (i.e. computes Sum,Sumsq, M, attribute value probabilities the sub-cluster consisting ofpoints in s and s′).

Data Structures

Data structures used during performance of the clustering evaluation arefound in FIGS. 8A-8D. An output or result of the clustering analysis isa data structure designated MODEL which includes an array 152 ofpointers to a first vector 154 of n elements (floats) designated ‘SUM’,a second vector 156 of n elements (floats) designated ‘SUMSQ’, and asingle floating point number 158 designated ‘M’ and an attribute/valueprobability table P (entries are floats) such as the table of FIG. 9A.The number M represents the number of database records represented by agiven cluster. The model includes K entries, one for each cluster.

The vector ‘SUM’ represents the sum of the weighted contribution of eachof the n continuous database record attributes that have been read infrom the database. As an example a typical record will have a value ofthe ith dimension which contributes to each of the K clusters. Thereforethe i-th dimension of that record contributes a weighted component toeach of the k SUM vectors. A second vector ‘SUMSQ’ is the sum of thesquared components of each record which allows straightforwardcomputation of the diagonal elements of the covariance matrix. In ageneral case the SUMSQ could be a full n×n matrix, allowing thecomputation of a full n×n covariance matrix. It is assumed for thedisclosed exemplary embodiment that the off diagonal elements are zero.A third component of the model is a floating point number ‘M’. Thenumber ‘M’ is determined by totaling the probability of membership for agiven cluster over all data points. These structures (vectors SUM,SUMSQ, value M and attribute/value probability table) constitute themodel output from the EM process for a given cluster K.

An additional data structure designated DS in FIG. 8A includes an arrayof pointers 160 that point to a group of K vectors (the cluster number)of the n continuous attribute elements 162 designated ‘SUM’, a secondgroup of K vectors 164 designated ‘SUMSQ’, a group 166 of k floatsdesignated M, and an attribute/value probability table P such as thetable shown in FIG. 9A. This data structure is similar to the datastructure of FIG. 81) that describes the MODEL. It contains sufficientstatistics of the continuous and discrete attribute values for a numberof data records that have been compressed into the FIG. 8A datastructure shown rather than maintained as individual records (FIG. 8C)in memory. Compression of the data into this data structure and the CSdata structure described below frees up memory for accessing other datafrom the database at the step 110 on a next subsequent iteration of theFIG. 4 clustering process.

A further data structure designated CS in FIG. 6B is an array of cpointers where each pointer points to an element which consists of avector of n elements (floats) designated ‘SUM’, a vector of n elements(floats) designated ‘SUMSQ’, and a scalar ‘M’. Again, an attribute/valueprobability table summarizes the discrete attributes of the pointscompressed into CS elements. The data structure CS also summarizesmultiple data points into structures similar to the MODEL data structureand represents a subcluster of data records.

As noted previously, the data structure designated RS (FIG. 6C) is agroup of r vectors having D=d+n dimensions that includes both continuousattributes (n continuous attributes) and discrete attributes (d discreteattributes). As data is read in from the database at the step 110 it isinitially stored in the set RS and then used to update the clustermodel. It is the updated model of FIGS. 8A-8D that is used indetermining which records in RS should be summarized in RS and CS.

Extended Clustering Procedure of FIGS. 7A and 7B

The extended clustering procedure 120 (FIGS. 7A and 7B) takes thecontents of the three data structures RS, DS, CS, stored in the datastructures of FIGS. 8A, 8B, 8C and produces a new model. The new model(including the updated cluster attribute/value probability tables P) isthen stored in place of the old model (FIG. 8D).

The data structures of FIG. 8A-8D are initialized 100 (FIG. 4) beforeany data is read from the database 10. In order for the clusteringprocedure 120 to process the first set of data read into the memory, theMODEL data structure of FIG. 8D that is copied into the Old_Model datastructure is therefore not null. An initial set of K cluster means orcentroids are chosen and one procedure for this initialization is torandomly choose the means and place them in the vector ‘Sum’ whilesetting M=1.0. Arbitrary values are chosen as diagonal members (SUMSQ)of the starting model's K covariance matrices. The diagonal values ofthe starting matrices are chosen to range in size from 0.8 to 1.2 fordata in which the continuous attributes have been normalized into arange [−5,5]. An initial attribute/value probability table for eachcluster is also arbitrarily assigned on the first iteration. Oneapproach may be to set all attribute values equally likely.

The clustering procedure 120 staffs by copying the existing model tocreate 202 an Old_Model in a data structure like that of FIG. 8D. Theprocess next determines 204 the length of the pointer arrays of FIGS.8C-8C, and computes the total number of data records summarized by theOld_Model, and computes 206 means and covariance matrices from theOld_Model SUM, SUMSQ and M data from the continuous data. The set ofOld_Model means and covariance matrices that are derived from thiscalculation are stored as a list of length K where each element of thelist includes two parts:

1) a vector of length n (called the “mean”which stores the mean of thecorresponding Gaussian or cluster

2) a matrix of size n×n (called the “CVMatrix”which stores the values ofa covariance matrix of the corresponding Gaussian or cluster.

The means and covariance matrices are referred to below as“Old_SuffStats”.

To compute the matrix CVMatrix for a given cluster from the sufficientstatistics SUM, SUMSQ and M (in FIG. 8D), the clustering procedurecomputes an outer product defined for 2 vectors OUTERPROD(vector1,vector2). The OUTERPROD operation takes 2 vectors of length n andreturns their outer product, or the n×n matrix with an entry in row hand column j being vector1(h)*vector2(j). A DETERMINANT functioncomputes the determinant of a matrix. The step 206 also uses a function,INVERSE that computes the inverse of a matrix. A function TRANSPOSEreturns the transpose of a vector (i.e. changes a column vector to a rowvector). A function EXP(z) computes the exponential e^(z).

A function ‘ConvertSuftats’ calculates 206 the mean and covariancematrix from the sufficient statistics stored in a cluster model(SUM,SUMSQ,M) for the continuous attributes.

[Mean,CVMatrix]=ConvertSuffStats(SUM,SUMSQ,M)

Mean=(1/M)*SUM:

MSq=M*M;

OutProd=OUTERPROD(SUM,SUM);

CVMatrix=(1/MSq)*(M*SUMSO−3*OutProd);

A function designated ‘GAUSSIAN’ defined below is used at a step 212(FIG. 7A) to compute the height of the Gaussian curve above a given datapoint, where the Gaussian has mean=Mean and covariance matrix=CVMatrix.

[height]=GAUSSIAN(x,Mean,CVMatrix)

normalizing_constant=(2*PI){circumflex over ()}(n/2)*SQRT(DET(CVMatrix));

CVMatrixInv=INVERSE(CVMatrix);

Height=(1/normalizing_constant)*exp(−(1/2)*(TRANSPOSE(x−Mean))*CVMatrixInv*(x−Mean));

Note, mathematically, the value of GAUSSIAN for a given cluster for thedatapoint x is:${height} = {\frac{1}{\left( {2\pi} \right)^{n/2}\sqrt{{CVMatrix}}} \cdot {\exp \left( {{- \frac{1}{2}}\left( {x - {Mean}} \right)^{T}({CVMatrix})^{- 1}\left( {x - {Mean}} \right)} \right)}}$

After resetting 208 a New_Model data structure (similar to FIG. 8D) toall zeros, each point of the RS data structure of data records isaccessed 210 and used to update the values of Sum, Sumsq, M (summarizingcontinuous data attributes) and the attribute/value probability tablesderived from the discrete attributes that make up the data structures ofthe New_Model. A contribution to each of the K clusters is determinedfor each of the data points in RS by determining the weight (equivalentto the probability of membership in each cluster) of each point underthe old model. A weight vector has K elements weight(1), weight(2), . .. weight(K) where each element indicates the normalized or fractionalassignment of the data point to the corresponding cluster. Recall thateach data record contributes to all of the K clusters that were set upduring initialization. The mean and covariance matrix structures allow aheight contribution for each record (for j=1, . . . , r) to bedetermined at step 212 of the extended EM procedure 120 for each cluster(for 1=1, . . . , K). The attribute/value probability tables provide thecontribution to the weight (probability of membership) calculation overthe discrete attributes. This height contribution and the contributionfrom the attribute/value probability table is then scaled to form aweight contribution that takes into account the fraction of data pointsassigned to cluster 1: M₁/ MTotal where MTotal is the total number ofdata records read thus far from the database.

Normalizing the weight factor is performed at a step 214 of theprocedure. At a step 216 an outer product is calculated for the relevantvector data point in RS. An update step 218 loops through all K clustersto update the new model data structure by adding the contribution foreach data point:

for j = 1, . . ., K  New_Model(j).SUM = New_Model(j).SUM +        weight(j)*center;  New_Model(j).SUMSQ = New_Model(j).SUMSQ +        Weight(j)*OuterProd;  New_Model(j).Attribute_Value_Table(row i,column h) =   [(New_Model(j).M)*New_Model(j).Attribute_Value_T  able(row i, column h) + Weight(j)]/[New_Model(j).M +   Weight(j)] ifthe data point under consideration has   value h for discrete attributei. Otherwise   New_Model(j).Attribute_Value_Table(row i, column h)   =  [(New_Model(j).M)*New_Model(j).Attribute_Value_T   able(row i, columnh)]/[New_Model(j).M + Weight(j)].  New_Model(j).M = New_Model(j).M +Weight(j); End for

The probability table of the New_Model is updated as described above,one record at a time.

The process of updating the data structures of the New_Model continuesfor all points in the RS data structure. On the first pass through theprocedure, the data structures DS and CS are null and the RS structureis made up of data read from the database. Typically a portion of themain memory of the computer 20 (FIG. 1) is allocated for the storage ofthe data records in the RS structure. On a later iteration of theprocessing loop of FIG. 4, however, the data structures DS, CS are notnull. To free up space for a next iteration of data gathering from thedatabase, some of the data in the structure RS is summarized and storedin one of the two data structures CS or DS. (FIGS. 8A, 8B)

After each of the single points in the RS structure has been used toupdate the model, a branch 220 is taken to begin the process of updatingthe New_Model data structure for all the subclusters in CS. Acontribution is determined for each subcluster CS (denoted as CS_Elem)by determining 230 the weight of each subcluster under the old model(determining probability of membership of the CS sub-cluster in a givenmodel cluster). First a center vector for the subcluster for thecontinuous attributes is determined 230 from the relationcenter=(1/CS_Elem.M)*CS_Elem.SUM. The probability of membership of theCS element in a given cluster, say cluster #1, is then computed as(fraction of data points in cluster #1)*[(P(DiscAttr#1=Val11|cluster#1)){circumflex over ()}(CS_Elem.M*CS_Elem.Attribute_Value_Table(1,1))*P(DiscAttr#1=Val12|cluster#1)){circumflex over ( )}(CS_Elem.M*CS_Elem.Attribute_Value_Table(1,2))*. . . *P(DiscAttr#1=Val1(V(1))|Cluster #1){circumflex over ()}(CS_Elem.M*CS_Elem.Attribute_Value_Table(1,V(1))*P(DiscAttr#2=Val21){circumflex over ()}(CS_Elem.M*CS_Elem.Attribute_Value_Table(2,1))* . . .*P(DiscAttr#=Vald(V(d))|Cluster #1)){circumflex over ()}(CS_Elem.M*CS_Elem.Attribute_Value_Table(d,t))]*(Gaussian(center,Meanof cluster 1, CV matrix of cluster 1)). In the above, we assume thatdiscrete attribute i has V(i) possible values.

A weight vector has K elements weight(1), weight(2), . . . weight(K)where each element indicates the normalized or fractional assignment ofa given subcluster to a cluster. This weight is determined 232 for eachcluster and the weight factor is normalized at a step 234 of theprocedure. An update step 238 for the subcluster of CS loops through allK clusters:

for j = 1, . . ., K  New_Model(j).SUM = New_Model(j).SUM +   Weight(j)*CS_Elem.SUM;  New_Model(j).SUMSQ = New_Model(j).SUMSQ +   Weight(j)*CS_Elem.SUMSQ;  New_Model(j).Attribute_Value_Table(row i,column h) =    [(New_Model(J).M)*New_Model(j).Attribute_Value_T   able(row i, column h)+   Weight(j)*CS_Elem.M*CS_Elem.Attribute_Value_Tab    le(row i, columnh)]/[(New_Model(j).M +    Weight(j)*CS_Elem.M], for all rows i and allcolumns    h in the Attribute_Value_Table.  New_Model(j).M =New_Model(j).M = weight(j) *        CS_Elem.M; End for

The probability table CS_Elem.Attribute_Value_Table is combined with theupdated probability table for the new model,New_Model(j).Attribute_Value_Table based on the number of recordssummarized by the subcluster and the values in the subclusterattribute/value probability table.

When the contribution of all the subclusters whose sufficient statisticsare contained in CS have been used to update the New_Model, a branch 240is taken to update the New_Model using the contents of the datastructure DS. A center for each of the k entries of DS is determined 250from the relation center=(1/DS_Elem.M)*DS_Elem.SUM. A weight of this DSstructure is then determined under the Old_Model in exactly the samefashion as the weights were determined for the CS structures above andthe weight is normalized 254. The contributions of each of thesubclusters is then added 260 to the sufficient statistics of theNew_Model:

for j = 1, . . ., K  New_Model(j).SUM = New_Model(j).SUM +   Weight(j)*DS_Elem.SUM;  New_Model(j).SUMSQ = New_Model(j).SUMSQ +   Weight(j)*DS_Elem.SUMSQ;  New_Model(j).Attribute_Value_Table(row i,column h) =    [(New_Model(j).M)*New_Model(j).Attribute_Value_Table(ro   w i, column h) +   Weight(j)*DS_Elem.M*DS_Elem.Attribute_Value_Table(row    i, columnh)]/[New_Model(j).M + Weight(j)*DS_Elem.M], for    all rows i and allcolumns h in the Attribute_Value_Table.  New_Model(j).M =New_Model(j).M + weight(j) * DS_Elem.M; End for

After the New_Model has been updated at the step 260 for each of the Kclusters, the extended EM procedure tests 265 whether a stoppingcriteria has been met. This test begins with an initialization of threevariables CV_dist=0, mean_dist=0, Ptable_dist=0. For each cluster a newco-variance matrix is calculated and a distance from the old mean andthe new mean as well as a distance between the new and old covariancematrices is determined for the continuous attributes. Similarly, adistance between probability tables is calculated. These values aretotaled for all the clusters:

For j = 1, . . ., K  [New_Mean, New_CVMatrix] = ConvertSuffStats(   New_Model(j).SUM, New_model(j).SUMSQ,    New_Model(j).M);  mean_dist= mean_dist +    distance(Old_SuffStats(j).Mean,New_mean);  CVDist =CV_dist + distance(Old_SuffStats(j).CVMatrix,    New_CVMatrix); Ptable_dist = Ptable_dist +distance(Old_Model(j).Attribute_Value_Table,   New_Model(j).Attribute_Value_Table). End for

The distance between attribute/value probability tables may be computedby summing the absolute values of the differences of the table entriesand dividing by the total number of entries in the attribute/valuetable.

The stopping criteria determines whether the sum of these two numbersand the absolute difference in probability tables summarizing thediscrete attributes is less than a stopping criteria value:

[(1/(3*k))*(mean_dist+CV_dist+Ptable_dist)]<stop_tol

If the stopping criteria is met the New_Model becomes the Model and theprocedure returns 268. Otherwise the New_Model becomes the old model andthe procedure branches 270 back to recalculate another New_model fromthe then existing sufficient statistics in RS, DS, and CS.

Stopping Criteria at Step 140

The scalable Expectation Maximization analysis is stopped (rather thansuspended) and a resultant model output produced when the test 140 ofFIG. 4 indicates the Model is good enough. Two alternate stoppingcriteria (other than a scan of the entire database) are used.

A first stopping criteria defines a probability function p(x) to be thequantity${p(x)} = {\sum\limits_{i = 1}^{K}{\frac{M(l)}{N}\left( {g\left( x \middle| l \right)} \right)}}$

where x is a data point or vector sampled from the database and 1) thequantity M(1) is the scalar weight for the lth cluster, (The number ofdata elements from the database sampled so far represented by cluster 1)2)N is the total number of data points or vectors sampled thus far, and3) g(x|l) is the probability function for the data point for the lthcluster. The value of g(x|l) is the product of the height of theGaussian distribution for cluster 1 evaluated over the continuousattribute values times the product of the values of the attribute/valuetable associated with cluster 1 taking the values of the attributesappearing in x.

Now define a function f(iter) that changes with each iteration.${f({iter})} = {\frac{1}{M}{\sum\limits_{i = 1}^{M}{\log \quad {p\left( x_{i} \right)}}}}$

The summation in the function is over all data points and thereforeincludes the subclusters in the data structure CS, the summarized datain DS and individual data points in RS. When the values of p(x_(i)) arecalculated, the probability function of a subcluster is determined bycalculating the weighting factor in a manner similar to the calculationat step 232. Similarly the weighting factor for the k elements of DS arecalculated in a manner similar to the step 252 in FIG. 8B. Consider twocomputations during two successive processing loops of the FIG. 4scalable EM analysis. Designate the calculations for these twoiterations as f_(z) and f_(z−1). Then a difference parameterd_(z)=f_(z)−f_(z−1). Evaluate the maximum difference parameter over thelast r iterations and if no difference exceeds a stopping tolerance STthen the first stopping criteria has been satisfied and the model isoutput.

A second stopping criteria is the same as the stopping criteria outlinedearlier. Each time the Model is updated K cluster means and covariancematrices are determined and the attribute/value probability tables forthe cluster are updated. The variables CV_dist, mean_dist andPtable_dist are initialized. For each cluster k the newly determinedcovariance matrix, mean, and attribute/value probability table arecompared with a previous iteration for these parameters. A distancebetween the old mean and the new mean as well as a distance between thenew and old covariance matrices and distance between old and newattribute/value probability tables Are determined. These values aretotaled for all the clusters:

For j = 1, . . ., K  [(New_Mean, New_CVMatrix] = ConvertSuffStats(   New_Model(j).SUM, New_model(j).SUMSQ,    New_Model(j).M);  mean_dist= mean_dist +    distance(Old_SuffStats(j).Mean,New_mean);  CVDist =CV_dist + distance(Old_SuffStats(j).CVMatrix,    New_CVMatrix); Ptable_dist = Ptable_dist +distance(Old_Model(j).Attribute_Value_Table,   New_Model(j).Attribute_Value_Table). End for

The stopping criteria determines vhether the sum of these numbers isless than a stopping criteria value:

[(1/(3*k))*(mean_dist+CV_dist+Ptable_dist)]<stop_tol

Computer System

With reference to FIG. 1 an exemplary data processing system forpracticing the disclosed data mining engine invention includes a generalpurpose computing device in the form of a conventional computer 20,including one or more processing units 21, a system memory 22, and asystem bus 23 that couples various system components including thesystem memory to the processing unit 21. The system bus 23 may be any ofseveral types of bus structures including a memory bus or memorycontroller, a peripheral bus, and a local bus using any of a variety ofbus architectures.

The system memory includes read only memory (RTOM) 24 and random accessmemory (RAM) 25. A basic input/output system 26 (BIOS), containing thebasic routines that helps to transfer information between elementswithin the computer 20, such as during start-up, is stored in ROM 24.

The computer 20 further includes a hard disk drive 27 for reading fromand writing to a hard disk, not shown, a magnetic disk drive 28 forreading from or writing to a removable magnetic disk 29, and an opticaldisk drive 30 for reading from or writing to a removable optical disk 31such as a CD ROM or other optical media. The hard disk drive 27,magnetic disk drive 28, and optical disk drive 30 are connected to thesystem bus 23 by a hard disk drive interface 32, a magnetic disk driveinterface 33, and an optical drive interface 34, respectively. Thedrives and their associated computer-readable media provide nonvolatilestorage of computer readable instructions, data structures, programmodules and other data for the computer 20. Although the exemplaryenvironment described herein employs a hard disk, a removable magneticdisk 29 and a removable optical disk 31, it should be appreciated bythose skilled in the art that other types of computer readable mediawhich can store data that is accessible by a computer, such as magneticcassettes, flash memory cards, digital video disks, Bernoullicartridges, random access memories (RAMs), read only memories (ROM), andthe like, may also be used in the exemplary operating environment.

A number of program modules may be stored on the hard disk, magneticdisk 29, optical disk 31, ROM 24 or RAM 25, including an operatingsystem 35, one or more application programs 36, other program modules37, and program data 38. A user may enter commands and information intothe computer 20 through input devices such as a keyboard 40 and pointingdevice 42. Other input devices (not shown) may include a microphone,joystick, game pad, satellite dish, scanner, or the like. These andother input devices are often connected to the processing unit 21through a serial port interface 46 that is coupled to the system bus,but may be connected by other interfaces, such as a parallel port, gameport or a universal serial bus (USB). A monitor 47 or other type ofdisplay device is also connected to the system bus 23 via an interface,such as a video adapter 48. In addition to the monitor, personalcomputers typically include other peripheral output devices (not shown),such as speakers and printers.

The computer 20 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer49. The remote computer 49 may be another personal computer, a server, arouter, a network PC, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 20, although only a memory storage device 50 has beenillustrated in FIG. 1. The logical connections depicted in FIG. 1include a local area network (LAN) 51 and a wide area network (WAN) 52.Such networking environments are commonplace in offices, enterprise-widecomputer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 20 is connectedto the local network 51 through a network interface or adapter 53. Whenused in a WAN networking environment, the computer 20 typically includesa modem 54 or other means for establishing communications over the widearea network 52, such as the Internet. The modem 54, which may beinternal or external, is connected to the system bus 23 via the serialport interface 46. In a networked environment, program modules depictedrelative to the computer 20, or portions thereof, may be stored in theremote memory storage device. It will be appreciated that the networkconnections shown are exemplary and other means of establishing acommunications link between the computers may be used.

While the present invention has been described with a degree ofparticularity, it is the intent that the invention include allmodifications and alterations from the disclosed implementations fallingwithin the spirit or scope of the appended claims.

Appendix A

The following development of the Bonferoni inequality, used to determinethe multidimensional confidence interval on the mean follows from page12 of [2] G. A. F. Seber. Miultivariate Observations. John Wiley & Sons,New York, 1984.

A conservative procedure for determining the multidimensional confidenceinterval on the mean vector of a set of multivariate observations isalways available using the Bonferoni inequality:${P\left( {\bigcap\limits_{i = 1}^{r}E_{i}} \right)} \geq {1 - {\sum\limits_{i = 1}^{r}{{P\left( {\overset{\_}{E}}_{i} \right)}.}}}$

Where {overscore (E)}_(i) is the complement of E_(i). If we use thecritical level of α/r for each test, then${P\left( E_{i} \right)} = {{1 - {\frac{\alpha}{r}{P\left( {\overset{\_}{E}}_{i} \right)}}} = \frac{\alpha}{r}}$${{P\left( {\bigcap\limits_{i = 1}^{r}E_{i}} \right)} \geq {1 - {\sum\limits_{i = 1}^{r}{P\left( {\overset{\_}{E}}_{i} \right)}}}} = {{1 - {r\left( \frac{\alpha}{r} \right)}} = {1 - {\alpha.}}}$

Hence, in our application, let E_(j) be the event that the j-th elementof the l-th current mean lies in the between the values. L_(j) ^(l)(lower bound) and U_(j) ^(l) (upper bound), or specifically, L_(j)^(l)≦{overscore (x)}_(j) ^(l)≦U_(j) ^(l). Here {overscore (x)}_(j) ^(l)is the j-th element of the l-th current mean. Here the values of L_(j)^(l) and U_(j) ^(l) define the 100 (1−α/r)% confidence interval on{overscore (x)}_(j) ^(l) which is computed as:${L_{j}^{l} = {{\overset{\_}{x}}_{j}^{l} - {l_{{(\frac{\alpha}{2n})},{({N - 1})}} \cdot \sqrt{\frac{S_{j}^{l}}{N}}}}},{U_{j}^{l} = {{\overset{\_}{x}}_{j}^{l} + {l_{{(\frac{\alpha}{2n})},{({N - 1})}} \cdot {\sqrt{\frac{S_{j}^{l}}{N}}.}}}}$

N is the number of singleton data points represented by cluster l,including those that have already been compressed in earlier iterationsand uncompressed data points. S_(j) ^(l) is an estimate of the varianceof the l-th cluster along dimension j. Let L^(l),U^(l) εR^(n) be thevectors of lower and upper bounds oi 1 the mean of cluster l.

The invention assigns data points (the continuous attributes of datarecords) to Gaussians in a probabilistic fashion. Two differenttechniques are proposed for determining the integer N, the number ofsingleton data points over which the Gaussian mean is computed. Thefirst way is motivated by the EM Gaussian center update formula which iscomputed over all of the data processed so far (whether it has beencompressed or not), hence in the first variant of the Bonferoni CIcomputation we take N to be the number of data elements processed by theScalable EM algorithm so far. The second variant is motivated by thefact that although the EM Gaussian center update is over all datapoints, each data point is assigned probabilistically to a givenGaussian in the mixture model, hence in the second variant of theBonferoni computations we take N to be the rounded integer of the sum ofthe probabilistic assignments over all data points processed so far.

The Bonferoni CI formulation assumes that the Gaussian centers, computedover multiple data samples of size N are computed as$\frac{1}{N}{\sum\limits_{i = 1}^{N}{x^{i}.}}$

This is true for the classic K-means algorithm, but is only guaranteedto be true for the first iteration of the EM algorithm. Hence adistribution other than the t distribution may better fit theassumptions on the distribution of the Gaussian center as computed bythe EM algorithm. This would result in a different formula for theBonferoni CI.

After determining the confidence intervals on the K Gaussian meansL^(l),U^(l)εR^(n),l=1, . . . , k, one technique perturbs the means sothat the resulting situation is a “worst case scenario” for a givensingleton data element. Assuming that the data point is x^(i), wepropose solving the following optimization problem for determining theperturbed cluster means and corresponding probabilistic assignment ofdata point x^(i) to the K perturbed Gaussians:$\min\limits_{{\overset{\sim}{x}}^{1},{\overset{\sim}{x}}^{2},\ldots \quad,{\overset{\sim}{x}}^{k}}{\left\{ {{\left. {- {\sum\limits_{l = 1}^{k}{{f\left( {\overset{\sim}{x}}^{l} \right)}\log \quad {f\left( {\overset{\sim}{x}}^{l} \right)}}}} \middle| {\sum\limits_{l = 1}^{k}{f\left( {\overset{\sim}{x}}^{l} \right)}} \right. = 1},{L^{l} \leq {\overset{\sim}{x}}^{l} \leq U^{l}},{l = 1},\ldots \quad,k} \right\}.}$

Here f({overscore (x)}^(l)) is the probabilistic assignment of datapoint x^(i) to the Gaussian centered at {overscore (x)}^(l), morespecifically:${{f\left( {\overset{\sim}{x}}^{l} \right)} = {{P\left( l \middle| x^{l} \right)} = {\frac{{p\left( x^{i} \middle| l \right)}{P(l)}}{p(x)} = \frac{{p\left( x^{i} \middle| l \right)}{P(l)}}{\sum\limits_{j = 1}^{k}{{p\left( x^{i} \middle| j \right)}{P(j)}}}}}},{where}$${p\left( x^{i} \middle| l \right)} = {\frac{1}{\left( {2\pi} \right)^{n/2}\sqrt{{\overset{\sim}{S}}^{l}}}\exp {\left\{ {{- \frac{1}{2}}\left( {x^{i} - {\overset{\sim}{x}}^{l}} \right)^{T}\left( {\overset{\sim}{S}}^{l} \right)^{{- 1}}\left( {x^{i} - {\overset{\sim}{x}}^{l}} \right)} \right\}.}}$

The perturbation becomes a more general optimization problem and theprocedure used in the K-mean case is a special case of the solution ofthis problem when 0/1 assignments are made between points and clusters.

We claim:
 1. In a computer data processing system, a method forclustering data in a database comprising the steps of: a) reading datarecords having both discrete and ordered attributes from a databasestorage medium and bringing a portion of the data records into a rapidaccess memory; b) initializing a cluster model that characterizes thedata within the database wherein the cluster model includes a table ofprobabilities for the enumerated or discrete data attributes of the datarecords for each cluster of a multiple number of clusters that make upthe cluster model and wherein the cluster model for data attributes thatare ordered comprises a mean and covariance for each cluster; c)updating the cluster model from the database records brought into therapid access memory; d) summarizing at least some of the databaserecords in the rapid access memory and storing a summarization withinthe rapid access memory; e) evaluating a criteria to determine iffurther data should be accessed from the database to further clusterdata records in the database; and f) based on the evaluating stepreading an additional number of records from the database storage mediumand bringing said additional number of records into the rapid accessmemory for further updating of the cluster model.
 2. The method of claim1 wherein the step of updating the cluster model includes the step ofadjusting the table of discrete attribute probabilities for a cluster bycalculating a weighted sum of the data records brought into the rapidaccess memory and a weighted sum for data records already summarized inthe cluster model.
 3. The method of claim 1 wherein the step of updatingthe cluster model includes the step of adjusting a data structure ofordered attribute mean and covariance values by calculating a weightedsum of the mean and covariance values of database records brought intothe rapid access memory and the mean and covariance values for recordsalready summarized in the cluster model.
 4. The method of claim 1wherein the step of updating the cluster model includes adjusting theordered attribute mean and spread values and the table of discreteattribute probabilities for a cluster by calculating a weighted sum ofthe mean and covariance values and probability values of databaserecords brought into the rapid access memory and the mean and covariancevalues and probability values for records already summarized in thecluster model.
 5. The method of claim 1 wherein both the ordered and thediscrete attributes are assigned a confidence interval and wherein thesummarizing step summarizes certain data records based upon theconfidence interval.
 6. The method of claim 5 wherein the step ofsummarizing the database records includes the step of determiningwhether a data point is suitable for summarization by performing aperturbation of the cluster model probabilities and verifying that thedata point is sufficiently described by the perturbed cluster modelprobabilities.
 7. The method of claim 5 wherein the step of summarizingthe database records includes the step of determining whether a datapoint is suitable for summarization by comparing the probability that adata point belongs to a cluster with a threshold probability value. 8.The method of claim 5 wherein the step of summarizing the data baserecords includes the step of performing a non-scalable clustering methodon the data points remaining in rapid access memory after some of thedatabase records have been summarized and storing the results of thenon-scalable clustering method in rapid access memory.
 9. The method ofclaim 5 further including the step of constructing a model of thedatabase based on the cluster probability tables, the summarizations ofdata points in rapid access memory, and data points in rapid accessmemory which have been neither compressed or summarized.
 10. The methodof claim 9 wherein the step of constructing a model of the databasecomprises the steps of: a) resetting a new model data structure to zero;b) determining a weighted contribution to the new model for eachunsummarized data point; and c) determining a weighted contribution ofthe summarized data points to the new model.
 11. The method of claim 10further comprising the steps of: a) providing an old model based upon apast modeling iteration; b) comparing the old model to the new model;and c) terminating the modeling process when the old model and new modelare sufficiently similar.
 12. The method of claim 1 wherein aprobability that a data record belongs in a cluster for data recordsextracted from the database is calculated using a covariance matrix forthe continuous attributes of the record.
 13. The method of claim 1wherein each cluster in the cluster model is characterized by adatapoint number for the cluster, a mean for each ordered dataattribute, a covariance for each ordered data attribute and aprobability table for each discrete data attribute and further whereineach data record read from the database storage medium contributes to anupdating of the cluster model for at least one cluster.
 14. The methodof claim 13 wherein there are K clusters in the cluster model andwherein each data record contributes to a cluster model for each of theK clusters.
 15. The method of claim 1 wherein the step of accessingdatabase records is performed using a sequential scan of the database.16. The method of claim 1 wherein the step of accessing database recordsis performed using a random index generator that does not repeat.
 17. Ina computer data mining system, apparatus for evaluating data in adatabase comprising: a) one or more data storage devices for storing adatabase of data records on a storage medium, said data recordsincluding attributes of both discrete or enumerated data and ordereddata; b) a computer having a rapid access memory and an interface to thestorage devices for reading data from the storage medium and bringingthe data into said rapid access memory for subsequent evaluation; and c)said computer comprising a processing unit for evaluating at least someof the data records in the database and for characterizing the datarecords into multiple numbers of data clusters, said processing unitprogrammed to retrieve a subset of data from the database into the rapidaccess memory, evaluate the subset of data to further characterize thedatabase clustering using a clustering criteria, and produce asummarization of at least some of the retrieved data records beforeretrieving additional data records from the database; said computerproducing a cluster model that includes cluster probabilities for thediscrete attributes and cluster means and covariance information for theordered data in the rapid access memory during data clustering.
 18. Theapparatus of claim 17 wherein said processing unit updates said clustermodel criteria based on said subset of said data records and previouslysummarized data from the database.
 19. The apparatus of claim 17 whereinsaid processing unit is further programmed to summarize certain data andto summarize certain other data according to subclusters having means,covariances, and probabilitity tables characterizing each of saidsubclusters.
 20. The apparatus of claim 17 further comprising an outputmeans for outputting a model of said database created by saidcharacterization of data into clusters.
 21. A computer readable mediumhaving stored thereon a data structure, comprising: a) a first dataportion containing a model representation of data records stored in adatabase, wherein at least some of the database records include mixeddata that includes both discrete data fields and continuous data fields;b) a second data portion containing sufficient statistics of a portionof the data records in the database; and c) a third data portioncontaining individual data records obtained from the database for usewith the sufficient statistics to determine said model representationcontained in the first data portion.
 22. The data structure of claim 21wherein said model representation comprises a set of clusters to whichdata records may be assigned based on the degree to which each clusterdescribes the data record.
 23. The data structure of claim 22 whereineach of said clusters is represented by a datapoint number, a mean foreach ordered attribute, a spread for each ordered attribute, and aprobability table for each discrete attribute.
 24. The data structure ofclaim 22 wherein each of said data records may be assigned to only oneof said clusters.
 25. The data structure of claim 21 wherein a seconddata portion containing sufficient statistics of a portion of the datarecords in the database is organized by cluster and includes a datarecord number, a mean for each ordered attribute, a spread for eachordered attribute, and a probability table for each discrete attributeof those records summarized in said sufficient statistics.
 26. The datastructure of claim 22 wherein each of said data records may be assignedto each of said clusters with a probability based on the degree to whicha given cluster describes said data record.
 27. The data structure ofclaim 21 wherein said sufficient statistics comprises a set of clustersto which data records may be assigned based on the degree to which eachcluster describes the data record.
 28. The data structure of claim 27wherein the sufficient statistics for each of said clusters isrepresented by a datapoint number, a mean for each ordered attribute, aspread for each ordered attribute, and a probability table for eachdiscrete attribute which in combination with individual data records isused to produce the cluster model representation.
 29. The data structureof claim 28 wherein each of said data records may be summarized with adata summarization associated with a data cluster or may also beassigned to a data summarization associated with a subcluster or may beleft as a vector data record.
 30. A computer-readable medium havingcomputer-executable components comprising: a) a database component forinterfacing with a database that stores data records containing bothenumerated or discrete and ordered values; b) a rapid access memorycomponent for storing at least of subset of said data records gatheredfrom the database for processing; c) a modeling component forconstructing and storing a model of said database by determining if adata record is sufficiently described by any of several clusters usingcluster criteria and for updating said cluster model based on said datarecords and for evaluating whether further of said data records shouldbe moved from said database into said rapid access memory for modeling.31. The computer readable medium of claim 30 wherein said databasecomponent is adapted to store and said modeling component to construct amodel of data records containing both enumerated or discrete and orderedvalues.
 32. The computer readable medium of claim 30 wherein saidmodeling component is adapted to store said model of said database inthe form of a datapoint number, a table of means, a table of spreads,and a table of probabilities for each cluster of said cluster model. 33.The computer readable medium of claim 30 wherein said modeling componentis adapted to compare a new model to a previously constructed model toevaluate whether further of said data records should be moved from saiddatabase into said rapid access memory for modeling.
 34. The computerreadable medium of claim 30 wherein said modeling component is adaptedto update said cluster model by calculating a weighted contribution byeach of said data records in said rapid access memory.